Logistic regression
A better approach
Modeling the distributions of Sepal.Length
The smooth curves are normal distributions fitted to the histograms
These are fitted assuming equal standard deviations
Deducing probabilities from histograms and their fits
Dots are ratios of bin sizes \(\frac{\text{virginica}}{\text{virginica}+\text{versicolor}}\)
The curve is the ratio of the fitted normal distributions.
The log-odds ratio or logit
The log-odds ratio is the logarithm of the ratio of the probability of being a virginica, \(\theta\) , over that of being a versicolor, \(1-\theta\)
\[
\text{log-odds} = \ln{\left( \frac{\theta}{1-\theta} \right)}
\]
Note that
the range of the log-odds ratio is \((-\infty, \infty)\)
\(\text{log-odds}=0\) when \(\theta = 1 - \theta\) (equal probability)
Surprise!
The log-odds ratio of the probability fitted on the iris data of is a straight line.
Idea behind logistic regression
Apparently, the log-odds of the parameter \(\theta(x)\) can be modeled as a straight line, a function of the predictor variable \(x=\) Sepal.Length
\[
\ln{\left( \frac{\theta(x)}{1-\theta(x)} \right)} = \beta_0 + \beta_1 x
\]
Equivalently
\[\theta(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}\]
The full model, including binomial distribution of samples
When observing an iris plant with sepal length \(x\) , we have:
a virginica with probability \(\theta(x)\)
a versicolor with probability \(1 - \theta(x)\)
The probability of observing \(k\) virginica individuals when observing \(n\) iris plants having a sepal length equal to \(x\) equals1 :
\[
p(k\text{ virginica }|n) = \text{Binom}(n,k,\theta) = \binom{n}{k} \theta^k (1 - \theta)^{n-k}
\]
where
\[
\theta = \theta(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}
\]
Logistic regression in R
Fit a line of the type \(y = \frac{1}{1 + e^{f(x)}}\) where \(f(x)\) is a linear function of \(x\) .
The range of \(f(x)\) is \((-\infty, +\infty)\) , but that of \(y\) is \((0,1)\)
log.model <- glm (Species ~ Sepal.Length,
family= 'binomial' )
How well does it fit the training data?
We classify as follows:
if \(p\leq0.5\) then species = 0
if \(p>0.5\) then species = 1
Prediction
Species versicolor virginica
versicolor 36 14
virginica 13 37
The training error of this classifier is 30%
But wait, we have more predictors than Sepal.Length
Namely also Sepal.Width, Petal.Width and Petal.Length
full.log.model <- glm (Species ~ Petal.Width +
Petal.Length +
Sepal.Width +
Sepal.Length,
family= 'binomial' )
The training error equals only 2%!
The naive Bayes classifier
Using Bayes rule as a classifier
Variables:
\(M\) : mitochondrial signal in the protein: \(y,n\) (yes, no)
\(L\) : location of the protein in the cell, possible values: \(c,m\) (cytoplasm, mitochondrion)
Challenge:
We have assessed the conditional distribution \(P(\text{M}|\text{L})\)
\(P(M = y|L = m) = 0.8\) , \(P(M = y|L = c) = 0.15\) .
A protein is randomly picked from a population with the distribution \(P(L=m) = 0.1\)
The protein drawn carries a mitochondrial signal: \(M = y\)
Predict the location of this protein
Using Bayes rule as a classifier
Plan:
Compare \(P(L=m|M=y)\) to \(P(L=c|M=y)\)
Decision rule
\(L=m\) if \(P(L=m|M=y) > P(L=c|M=y)\) , otherwise \(L=c\)
\[
\begin{align}
P(L=m|M=y) &= \frac{P(M=y|L=m) \cdot P(L=m)}{P(M=y)} \\
&= \frac{0.8 \times 0.1}{P(M=y)} = \frac{0.08}{P(M=y)}
\end{align}
\]
\[
P(L=c|M=y) = \frac{0.135}{P(M=y)} \quad \text{(Show this)}
\]
We can now decide that \(L = c\) . We don’t have to know \(P(M=y)\) !
We have another, continuous predictor for protein location
Previously we needed a to know the discrete distribution \(P(M|L)\) .
What if \(S\) is a continuous variable and we need \(p(S|L)\) ?
Assess \(S\) in proteins with known location
Fit the data using a suitable distribution function
Obtain two conditional probability distribution functions , one for mitochondrial and one for cytoplasmic proteins
Using Bayes rule on a continuous predictor
The protein that we picked earlier has a value \(S = 22\)
What is the location of this protein?
Compare \(p(L=c | S=22)\) to \(p(L=m | S=22)\)
\[
\begin{align}
p(L = c | S = 22) &\propto p(S=22 | L = c) \times P(L = c) \\
&\propto 0.00769 \times 0.9 = 0.00692
\end{align}
\]
\[
\begin{align}
p(L = m | S = 22) &\propto p(S = 22 | L = m) \times P(L = m) \\
&\propto 0.0225 \times 0.1 = 0.00225
\end{align}
\]
We decide (again) that \(L = c\)
Intermezzo: probability density is not a probability
Note that we are sloppy in the previous slide.
\(p(S|L)\) is a probability density function (of \(S\) ), not a probability.
However, in a very tiny interval \(S = 22 \pm \delta/2\)
\[
P\left( S = 22 \pm \delta/2 | L\right) \approx p(S=22|L) \times \delta
\]
Using Bayes rule with both predictors
By definition
\[
p(M|L) = \frac{p(M,L)}{p(L)} \quad \text{and} \quad p(S|M,L) = \frac{p(L,M,S)}{p(M,L)}
\]
From which we get \(\qquad p(L,M,S) = p(S|M,L) \cdot p(M|L) \cdot p(L)\)
Also, by definition
\[
p(L|M,S) = \frac{p(L,M,S)}{p(M,S)}
\]
By substitution we obtain:
Bayes rule for joint distributions
\[
p(L|M,S) = \frac{p(S|M,L) \cdot p(M|L) \cdot p(L)}{p(M,S)}
\]
But how can we estimate \(p(S|M,L)\) ?
The problem with multiple predictors
\(p(S|M,L)\) is a set of 4 probability density functions (4 combinations in the Cartesian product \(M \times L\) )
In case of more predictor variables the dimensions of \(p(X_1|X_2, X_3,\ldots,X_n,L)\) explode
Solution: the naive assumption
All predictor variables are independent conditional on \(L\)
For example \(p(S|M,L) = p(S|L)\) : \(S\) does not depend on \(M\) , in the sub-populations of \(L\)
Applying the naive assumption we get
\[
p(L|M,S) = \frac{p(S|L) \cdot p(M|L) \cdot p(L)}{p(M,S)}
\]
In case of our example
Compare \(p(L=m|S=22,M=y)\) to \(p(L=c|S=22,M=y)\)
\[
\begin{align}
& p(L = m | S=22,M = y) \\
& \propto p(S=22|L = m) \times p(M = y | L = m) \times p(L = m) \\
& = 0.0225 \times 0.8 \times 0.1 = 0.0018
\end{align}
\]
\[
\begin{align}
& p(L = c | S = 22, M = y) \\
& \propto p(S=22|L = c) \times p(M = y | L = c) \times p(L = c) \\
& = 0.00769 \times 0.15 \times 0.9 = 0.00104
\end{align}
\]
Causal scheme for the naive assumption
\[M \perp S|L\]
For a given location, S and M are independent variables
All correlation between S and M is explained by the location
Naive assumption in general
Every pair of predictors is independent conditional on the response
Conclusion
Despite (because of?) the naive assumption, naive Bayes classifiers often perform extremely well1
See example on the iris data set in the syllabus
Prior distributions can be modified when desired (here \(p(L)\) )
They are used a lot as a simple type of classifier
Only use it as a classifier, not to predict distributions over classes!